Skip to content

Conversation

@konstantinb
Copy link
Contributor

@konstantinb konstantinb commented Dec 16, 2025

What changes were proposed in this pull request?

HIVE-29367: fixing overflows in ConvertJoinMapJoin calculations

Why are the changes needed?

ConvertJoinMapJoin does not use StatsUtils.safeAdd()/saveMult() for all its calculations. There are some real life scenarios when it could perform a catastrophic decision to convert a join to a mapjoin after calculating negative size for the 'small" table, resulting in an OOM during query processing

Does this PR introduce any user-facing change?

No

How was this patch tested?

Via unit testing and with load testing on a custom Hive installation based of 4.0x version

You can see the test output generated by the pre-fix code here: HIVE-29637-mapjoin_stats_overflow q out original
it clearly confirms the decision of perform a mapjoin despite very large volume of data

@konstantinb konstantinb changed the title HIVE-29367: preventing Long overflows in ConvertJoinMapJoin HIVE-29367: prevent Long overflows in ConvertJoinMapJoin Dec 18, 2025
@konstantinb konstantinb marked this pull request as ready for review December 18, 2025 19:18
@sonarqubecloud
Copy link

}
Operator<? extends OperatorDesc> parentOp = joinOp.getParentOperators().get(pos);
totalSize += computeOnlineDataSize(parentOp.getStatistics());
totalSize = StatsUtils.safeAdd(totalSize, computeOnlineDataSize(parentOp.getStatistics()));
Copy link
Member

@deniskuzZ deniskuzZ Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's appropriate to use safeAdd for a table size?
on the other side hashTableDataSizeAdjustment does that as well, so I guess it's fine
cc @zabetak, @thomasrebele

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. The total size here does not need to be 100% correct, it's just an estimation that influences the join decision. Might make sense to rename it to estimatedTotalSize.

if (cs != null) {
String colTypeLowerCase = cs.getColumnType().toLowerCase();
long nonNullCount = cs.getNumNulls() > 0 ? numRows - cs.getNumNulls() + 1 : numRows;
long nonNullCount = cs.getNumNulls() > 0 ? Math.max(1L, numRows - cs.getNumNulls() + 1) : numRows;
Copy link
Member

@deniskuzZ deniskuzZ Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Math.max(0L, numRows - cs.getNumNulls()) + 1

@deniskuzZ
Copy link
Member

@konstantinb, do we need same for
long llapMaxSize = (long) (maxSize + (maxSize * overSubscriptionFactor * slotsPerQuery))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants